edited image
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Singapore (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Can VLMs Detect and Localize Fine-Grained AI-Edited Images?
Sun, Zhen, Zhang, Ziyi, Luo, Zeren, Zhong, Zhiyuan, Sha, Zeyang, Cong, Tianshuo, Li, Zheng, Cui, Shiwen, Wang, Weiqiang, Wei, Jiaheng, He, Xinlei, Li, Qi, Wang, Qian
Fine-grained detection and localization of localized image edits is crucial for assessing content authenticity, especially as modern diffusion models and image editors can produce highly realistic manipulations. However, this problem faces three key challenges: (1) most AIGC detectors produce only a global real-or-fake label without indicating where edits occur; (2) traditional computer vision methods for edit localization typically rely on costly pixel-level annotations; and (3) there is no large-scale, modern benchmark specifically targeting edited-image detection. To address these gaps, we develop an automated data-generation pipeline and construct FragFake, a large-scale benchmark of AI-edited images spanning multiple source datasets, diverse editing models, and several common edit types. Building on FragFake, we are the first to systematically study vision language models (VLMs) for edited-image classification and edited-region localization. Our experiments show that pretrained VLMs, including GPT4o, perform poorly on this task, whereas fine-tuned models such as Qwen2.5-VL achieve high accuracy and substantially higher object precision across all settings. We further explore GRPO-based RLVR training, which yields modest metric gains while improving the interpretability of model outputs. Ablation and transfer analyses reveal how data balancing, training size, LoRA rank, and training domain affect performance, and highlight both the potential and the limitations of cross-editor and cross-dataset generalization. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- (5 more...)
- Information Technology (0.68)
- Media (0.48)
- Government (0.46)
3D-Consistent Multi-View Editing by Diffusion Guidance
Bengtson, Josef, Nilsson, David, Lee, Dong In, Kahl, Fredrik
Recent advancements in diffusion models have greatly improved text-based image editing, yet methods that edit images independently often produce geometrically and photometrically inconsistent results across different views of the same scene. Such inconsistencies are particularly problematic for editing of 3D representations such as NeRFs or Gaussian Splat models. W e propose a training-free diffusion framework that enforces multi-view consistency during the image editing process. The key assumption is that corresponding points in the unedited images should undergo similar transformations after editing. T o achieve this, we introduce a consistency loss that guides the diffusion sampling toward coherent edits. The framework is flexible and can be combined with widely varying image editing methods, supporting both dense and sparse multi-view editing setups. Experimental results show that our approach significantly improves 3D consistency compared to existing multi-view editing methods. W e also show that this increased consistency enables high-quality Gaussian Splat editing with sharp details and strong fidelity to user-specified text prompts.
- North America > United States (0.04)
- Asia > Singapore (0.04)
- Europe > Sweden (0.04)
- (3 more...)
IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment
Qu, Bowen, Sun, Shangkun, Liang, Xiaoyu, Gao, Wei
Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not well aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding edited results from different editing methods, and nearly 4,000 samples with corresponding Mean Opinion Scores (MOS) provided by 15 human subjects. Furthermore, we introduce IE-Critic-R1, which, benefiting from Reinforcement Learning from Verifiable Rewards (RLVR), provides more comprehensive and explainable quality assessment for text-driven image editing that aligns with human perception. Extensive experiments demonstrate IE-Critic-R1's superior subjective-alignments on the text-driven image editing task compared with previous metrics. Related data and codes are available to the public.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
The Promise of RL for Autoregressive Image Editing
Ahmadi, Saba, Awal, Rabiul, Sikarwar, Ankur, Kazemnejad, Amirhossein, Luo, Ge Ya, Rodriguez, Juan A., Rajeswar, Sai, Reddy, Siva, Pal, Christopher, Krojer, Benno, Agrawal, Aishwarya
While image generation techniques are now capable of producing high-quality images that respect prompts which span multiple sentences, the task of text-guided image editing remains a challenge. Even edit requests that consist of only a few words often fail to be executed correctly. We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.
- North America > Canada > Quebec > Montreal (0.14)
- Europe > Middle East > Malta > Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Learning an Image Editing Model without Image Editing Pairs
Kumari, Nupur, Wang, Sheng-Yu, Zhao, Nanxuan, Nitzan, Yotam, Li, Yuheng, Singh, Krishna Kumar, Zhang, Richard, Shechtman, Eli, Zhu, Jun-Yan, Huang, Xun
Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Singapore (0.04)
- Government > Voting & Elections (0.46)
- Leisure & Entertainment (0.31)
Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
Parihar, Rishubh, Patashnik, Or, Ostashev, Daniil, Babu, R. Venkatesh, Cohen-Or, Daniel, Wang, Kuan-Chieh
Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Asia > India > Karnataka > Bengaluru (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)